Automated analysis of vessel morphometry in retinal images from a Danish high street optician setting

Purpose To evaluate the test performance of the QUARTZ (QUantitative Analysis of Retinal vessel Topology and siZe) software in detecting retinal features from retinal images captured by health care professionals in a Danish high street optician chain, compared with test performance from other large population studies (i.e., UK Biobank) where retinal images were captured by non-experts. Method The dataset FOREVERP (Finding Ophthalmic Risk and Evaluating the Value of Eye exams and their predictive Reliability, Pilot) contains retinal images obtained from a Danish high street optician chain. The QUARTZ algorithm utilizes both image processing and machine learning methods to determine retinal image quality, vessel segmentation, vessel width, vessel classification (arterioles or venules), and optic disc localization. Outcomes were evaluated by metrics including sensitivity, specificity, and accuracy and compared to human expert ground truths. Results QUARTZ’s performance was evaluated on a subset of 3,682 images from the FOREVERP database. 80.55% of the FOREVERP images were labelled as being of adequate quality compared to 71.53% of UK Biobank images, with a vessel segmentation sensitivity of 74.64% and specificity of 98.41% (FOREVERP) compared with a sensitivity of 69.12% and specificity of 98.88% (UK Biobank). The mean (± standard deviation) vessel width of the ground truth was 16.21 (4.73) pixels compared to that predicted by QUARTZ of 17.01 (4.49) pixels, resulting in a difference of -0.8 (1.96) pixels. The differences were stable across a range of vessels. The detection rate for optic disc localisation was similar for the two datasets. Conclusion QUARTZ showed high performance when evaluated on the FOREVERP dataset, and demonstrated robustness across datasets, providing validity to direct comparisons and pooling of retinal feature measures across data sources.


Introduction
The retina of the eye is considered a part of the central nervous system (CNS) and is said to be a window to the brain and circulatory system [1,2]. Not only do the retina and brain share anatomical and embryonic development characteristics, but also the microvascular circulation and regulation in the brain and retina are similar [1]. Viewing the vessels of the retina provides a unique opportunity to study the blood circulatory system. While systemic blood circulation can be visualized by using invasive procedures such as angiography x-ray examinations, the vessels of the retina are captured by non-invasive fundus images. Changes in retinal vessel tortuosity and diameter have previously been linked to cardiovascular disease, diabetes, and glaucoma [3][4][5]. Also, CNS and systemic diseases such as ischemic brain incidence, stroke, multiple sclerosis, and Alzheimer's disease have recognized ocular manifestations [2]. Thus, it is evident to believe, that retinal vessels may be a biomarker for identifying early signs of both ocular and systemic diseases and can be used as a predictor for disease development.
Retinal imaging is part of a routine eye examination when visiting an ophthalmologist. In recent years it has gained popularity in high street optician chains, as retinal imaging demands limited training and captures important signs of disease pathology such as changes in the optic nerve, macula and blood vessels of the retina [6]. The interest in using artificial intelligence (AI) in healthcare as a supplement to routine eye examinations is growing, as its ability to help clinicians manage routine tasks and analyse large amounts of data effectively has the potential to transform healthcare [7][8][9][10]. Image recognition/diagnosis classification and the search for new prognostic risk factors are of particular interest [11]. Automation and AI in ophthalmology spans a multitude of approaches, from traditional image processing (unsupervised techniques e.g., edge detection or morphological operators), to machine learning techniques (e.g., supervised learning and unsupervised learning), which use hand-crafted features, to deep learning (a subfield of machine learning) that can automatically learn features. Current research in AI software for automatic analysis of retinal fundus images includes studies on cardiovascular disease, diabetic retinopathy, age-related macular degeneration, retinopathy of prematurity, neonatal fundus haemorrhages, glaucoma, and retinal breaks and detachments [11][12][13][14][15][16][17][18][19][20][21].
Vessel morphometry (also known as vasculometry) is an approach for studying biomarkers of disease. This requires retinal images to be converted into quantitative measurements, including measures of vessel width, area, and tortuosity. However, this is a time-consuming task for human observers and not feasible for studies examining vasculometric associations with disease, which demand big sample sizes to generate enough power to study small, yet meaningful group differences [22]. Hence, different software programmes for automated vessel analysis have been developed [23,24], including QUARTZ (QUantitative Analysis of Retinal vessel Topology and siZe) [25]. QUARTZ converts retinal images into quantitative measures of vessel morphometry for use in epidemiological studies [26][27][28]. QUARTZ analyses the entire retina (not limited to concentric areas around the optic disc), and evaluates image quality, vessel segmentation, arteriole/venule (A/V) classification, width, area, and tortuosity measurements of retinal vessels, and localisation of the optic disc (Fig 1) [4].
QUARTZ has previously been validated on the UK Biobank dataset [27]. The UK Biobank contains data from more than 502,656 UK citizens (40-69 years of age) collected from 2006 to 2010 [29]. Of all the participants, 68,549 had retinal images (45-degree field-of-view and 2048 x 1536 pixels image size) taken at baseline [27,29]. Output data from QUARTZ has previously been combined with epidemiological data from the UK Biobank cohort in the search for potential new biomarkers of disease. Studies have investigated the associations between vessel morphometry and glaucoma as well as cardiometabolic risk factors, including its ability to predict myocardial infarction and stroke [4,[30][31][32][33]. Although QUARTZ was originally developed for use in UK Biobank, it is relevant to examine the performance of QUARTZ on multiple datasets using different image capture systems with images taken by experts and non-experts, as future versions of QUARTZ may be targeted at the clinic and hence should demonstrate robustness and high performance with few limitations across datasets [7,34]. Thus, the aim of this paper was to validate QUARTZ on a further dataset with a different image acquisition protocol. The performance of QUARTZ was investigated on the FOREVERP (Finding Ophthalmic Risk and Evaluating the Value of Eye exams and their predictive Reliability, Pilot) dataset from a Danish optician chain, and QUARTZ's generalizability across datasets was examined by comparing the performance with previously published data from the UK Biobank.

Methods
The methods applied in this paper have been detailed previously by Welikala et al. [27]. The same methods were used to ensure the comparability of performance parameters across datasets.

The FOREVER dataset
Project FOREVER has been approved by the National Committee on Health Research Ethics, Denmark (project id H-21026000). The design and methodology of project FOREVER has been described thoroughly by Freiberg. et al. [35]. For participants enrolling in project FOR-EVER, informed written consent will be collected. The FOREVER (Finding Ophthalmic Risk and Evaluating the Value of Eye exams and their predictive Reliability) dataset contains data from Danish citizens, aged above 18 years, visiting an optician shop in Denmark. The dataset includes eye examinations: visual acuity, refraction, corneal thickness, intraocular pressure, retinal images and perimetries. A subset of the FOREVER dataset contains additional data on blood pressure, saliva samples for genetic analysis and Optical Coherence Tomography (OCT) scans. As Danish citizens have a unique social security number, the FOREVER dataset can be linked to the national registries enabling comprehensive linkage to disease risk and outcome data.
Enrolment of participants in the FOREVER cohort began in July 2022. The dataset used for validating QUARTZ to the FOREVER dataset consisted of images from the same Danish optician shops as in the FOREVER cohort. The dataset consisted of a subset of 3,682 images from 1,139 anonymized customers visiting an optician shop between February 2018 to May 2021. The dataset is referred to as "FOREVERP" (FOREVER, Pilot). The images were randomly selected for validation of QUARTZ and were not images from the FOREVER cohort, as validation was performed prior to the enrolment of participants in the FOREVER cohort. However, images from the FOREVERP dataset are comparable to images from the FOREVER dataset given that the image acquisition protocol is the same. Images from the FOREVERP can eventually be part of the FOREVER database if FOREVERP participants decide to enrol in the FOREVER cohort by given written consent.
The macular-centred retinal fundus images in "FOREVERP"were captured without mydriasis using digital non-mydriatic retinal cameras (Canon CR-2 AF) which incorporate Canon EOS 70D and Canon EOS 80D cameras. The retinal image photographers were trained personnel who either 1) attended a two-day course in fundus imaging, tonometry and perimetry enabling them to recognize errors and artefacts, or 2) had been trained by an optometrist. The optometrists are continuously trained with two mandatory and four optional training days per year, with a focus on specialized training in eye diseases such as glaucoma and diabetic retinopathy. The images were collected from multiple visits over several years, and the number of images varied per participant. Images were macular-centred and had a 45-degree field of view. Images were in BMP format and of multiple image sizes ranging from 1824 x 1216 pixels to 3984 x 2656 pixels, resized to 3984 x 2656 pixels.

Performance parameters
The performance of the algorithm was compared with a reference standard or ground truth (GT). The GT was derived from data annotation performed by human observers (JF and RAW) [36] using purpose-built software. The performance of the algorithms was compared with the GT (e.g. comparison with labelled pixels, images, vessel segments, vessel widths etc.) and most were assessed by calculating the performance parameters of sensitivity, specificity and accuracy (Table 1) [37]. Sensitivity refers to the percentage of the positives that are correctly classified as positive (TP). Specificity refers to the percentage of negatives that are correctly classified as negative (TN). Accuracy refers to the proportion of the outcomes correctly predicted as either positive (TP) or negative (TN) [37,38].

Automated image quality
Supervised learning (support vector machine classifier with the radial basis function kernel) along with global shape features (area, fragmentation, and complexity) measuring the segmented vessel map was used to classify images as either of inadequate or adequate quality. This approach was designed for use in epidemiological studies; hence an image can still be deemed adequate even if only a portion of the vasculature is visible [26] (Fig 2). 1,000 images were randomly selected and manually labelled by one human observer (RAW). Of the images, 826 were manually labelled as of adequate quality and 174 images as inadequate. The supervised classifier was trained with 500 images (using 5-fold cross-validation for model selection) and evaluated using a test data set of 500 images. A TP outcome equalled an image correctly classified as being of inadequate quality (Fig 2). The probability output from the classifier was normalized on a scale from 0 to 1 and flipped, to generate an image quality score (1 = highest quality).

Vessel segmentation
An unsupervised approach based on a multi-scale line detector and hysteresis thresholding based morphological reconstruction was used for vessel segmentation (Fig 3) [26]. The test set consisted of 10 randomly selected images of adequate quality. Two human observers (JF, RAW) manually labelled the test set independently, creating two separate sets of 10 images. The vessel segmentation from the first human observer (RAW) constituted the GT. The segmentations made by the second human observer (JF) were considered the target performance level that the automated segmentation should aim to achieve. The performance was evaluated per pixel with and without pre-and post-processing. Pre-processing refers to the removal of  pixels of bright intensities whereas post-processing refers to the removal of the fovea and small objects falsely segmented as vessels [26,27].

Vessel width measurements
An unsupervised approach was used for measuring vessel widths. This included creating centrelines (segmentation thinned) and edge points (zero-crossings of the second derivative), followed by measuring the distance between edge points orthogonal to the vessel centreline orientation (Fig 4) [27]. The test set consisted of 2,150 vessel profiles from 10 images of adequate quality. 961 profiles were from normal vessel segments without a strong central reflex, and with even illumination. 552 profiles showed a central reflex, and 637 profiles had low contrast or uneven illumination. Two human observers (JF, RAW) manually labelled the test set independently (Fig 5), and the mean of the two observers was used as the GT. To evaluate the agreement of measurements between QUARTZ and GT, a Bland-Altman plot was conducted.

Arteriole and venule classification
Supervised learning was used for classifying vessels into arterioles or venules. This included the use of deep learning, specifically a 6-layered convolutional neural network [28]. A total of  100 images of adequate quality were randomly selected and divided into a training set, a validation set, and a test set consisting of 50, 15 and 35 images, respectively. Two human observers (JF, RAW) manually labelled 50 images each. Classification of vessels was evaluated on both pixels and vessel segments. A vessel segment refers to the part of a vessel between bifurcations and crossover points. The human observers used the following criteria for distinguishing between arterioles and venules [28,39]: • Colour: Venules appear darker than arterioles.
• Diameter: The arterioles are thinner compared with adjacent venules.
• Central reflex: The central reflex is wider in arterioles compared with venules of comparable size [28,40].
• Branching: When labelling small vessels without colour differences or visible central reflexes, vessel branching was followed.

Optic disc localization
An unsupervised approach was used to determine the localization of the optic disc in the macular centre fundus images. This included the use of shade correction followed by the location of maximum intensity within a search region with constraints set. The test set comprised 300 images of adequate quality. One human observer manually labelled the images by marking the localisation of the optic disc.

Automated image quality assessment
The ability of QUARTZ to detect low-quality images, evaluated on the 500 test set images, was calculated to have a sensitivity of 91.95% and a specificity of 95.64%. This equates to 80.40% of all images in the test set being labelled as of adequate quality (TN and FN) and of these, 98.26% were correctly labelled as of adequate quality (TN). When applying the automated algorithm to the full subset of 3,682 images, 80.55% (2,966 images) were labelled as adequate quality; with 95.17% of the participants having at least one image labelled as of adequate quality ( Table 2). As these numbers include images from several years, they may overestimate the actual number of participants with an image of adequate quality. Evaluating images from one year (2021) showed consistency in image quality with 93.94% of the participants having at least one image labelled as adequate. The performances stated above equated to images being labelled as inadequate if the quality score was � 0.48.

Vessel segmentation
The sensitivity of vessel segmentation performed by QUARTZ without pre/post-processing was 80.31% and 74.64% with pre/post-processing ( Table 3). The specificity was 98.02% and 98.41% without and with pre/post-processing, respectively. Compared with the GT, the 2 nd human observer (JF) achieved a sensitivity of 75.07% and a specificity of 98.29%. Thus, the performance of QUARTZ was comparable to that achieved by the 2 nd human observer.

Vessel width measurements
In general, QUARTZ measured vessels as having a width greater than the GT ( Table 4). The mean difference (± standard deviation) calculated as the GT minus QUARTZ varied from -1.40 (2.23) pixels (vessels with low contrast/uneven illumination) to -0.37 (1.91) pixels (normal vessel segments). The overall difference for all vessel profiles was -0.80 (1.96) pixels. The correlation coefficient between the GT and QUARTZ was 0.9111, demonstrated via a scatter plot in Fig 6. The differences between the GT and QUARTZ were overall stable across vessel widths showing minor linear patterns that might be vessel segment specific as visualised by the Bland-Altman plot (Fig 6). However, there appears to be little systematic error. A low variance in width measurements between the GT and QUARTZ ensures that the obtained width measurements are measured consistently. A low variance is thus more important compared with the absolute difference between the GT and QUARTZ [27].

Arteriole and venule classification
Since an arteriole is followed by an adjacent venule, the two classes (arteriole and venule) are approximately balanced, and therefore assessing accuracy is sufficient. The accuracy for the classification of arterioles and venules was evaluated with respect to per pixel and segment. The accuracy was 89.25% per pixel and 86.32% per segment for both arterioles and venules.
Other performance measurements are provided in Table 5. By increasing the probability threshold for arteriole/venule classification, the sensitivity, specificity, and accuracy were improved. However, increasing the threshold and thereby improving the performance of the algorithm resulted in a loss of data (Table 6). Increasing the probability threshold to >0.8 more than halved the dataset. A threshold of >0.9 increased the sensitivity and accuracy to exceed 99%, but at the cost of approximately 2/3 of the dataset. A threshold of 0.8 has been chosen in previous epidemiological studies as an appropriate threshold ensuring both high data volume and performance of the algorithm [3]. Table 4. FOREVERP width measurements. Width measurements in pixels listed as mean (μ) + standard deviation (σ) for the following vessel segments: normal vessel segment, vessels with central reflex, vessels with low contrast/uneven illumination and an average of all vessel profiles measured by human observer 1, human observer 2, ground truth and QUARTZ. Also, the differences μ (σ) between ground truth and QUARTZ are listed for all vessel profiles addressed.

Optic disc localization
Of the 300 randomly selected images, QUARTZ demonstrated a detection rate of 97.33% in terms of correctly identifying the location of the optic disc.

Discussion
Overall, QUARTZ demonstrated high test performance in measuring retinal vessel width on the FOREVERP dataset, exhibiting good agreement with ground truth measures when compared with previously published data from UK Biobank [27]. It is of great importance to show similar test performance of QUARTZ in a different geographic population using a different image acquisition system, as it is a recognized challenge for automated retinal image analysis software to perform well on multiple datasets that vary in population and protocols [41]. Poor agreement between vasculometry measures using different software may limit the homogeneity of the associations between retinal vasculometry and disease across studies. The ability to show high performance across different datasets is highly relevant as it ensures the validity of the measurements and thus provides high quality input for future epidemiological studies investigating associations between vessel morphology and risk factors for disease.
High image quality is crucial for the reliability of the other parameters evaluated by the algorithm. The image quality of the FOREVERP dataset exceeded the image quality of the UK Biobank dataset, with 80.55% (FOREVERP) compared with 71.53% (UK Biobank) of the images labelled as adequate [27]. There may be various reasons that cause reduced image quality, including technical parameters such as poor illumination, lens artefacts, defocus, or blur. Other participant or operator-related factors may affect image quality by causing obstructions or multiple falsely segmented non-vessel objects, such as the capture of eyelashes, visible choroid layer, exudates, drusen, haemorrhages, retinitis pigmentosa, retinal scars, asteroid hyalosis or inflammation of the optic disc [26,42]. The high quality of the FOREVERP dataset compared with the UK Biobank dataset may partly be explained by better trained retinal photographers. In UK Biobank, the personnel had basic training in fundus imaging [26], whereas the camera operators in the FOREVERP dataset were trained personnel supervised by optometrists.
The segmentation of retinal vessels is challenging due to their complex structure and heterogeneity in terms of shape, size, and intensity. Moreover, illumination and poor contrast may further complicate distinguishing small vessels from background noise [43]. In the search for new retinal vascular risk factors, it is more important to ensure that the algorithm correctly avoids non-vessel objects. A high specificity ensures, that the marked vessels are correctly identified. The specificity of the overall vessel segmentation was similar for the two datasets, whereas the sensitivity and accuracy of vessel segmentation were improved in the FOREVERP dataset compared with the UK Biobank dataset (Table 7). By applying pre/post-processing to the images, the specificity of the FOREVERP (98.41%) and UK Biobank dataset (98.88%) Table 6. Arteriole and venule classification. Sensitivity of arteriole and venule classification and the arteriole/venule accuracy of the FOREVERP dataset reported for the following thresholds: >0.5, >0.6, >0.7, >0.8, >0.9. The fraction of data retained for every threshold is reported alongside the performance parameters. increased, at the cost of reduced sensitivity. Also, the performance parameters of arteriole and venule classification were improved in the FOREVERP dataset compared with UK Biobank. Exceptions were the specificity of arteriole classification/sensitivity of venule classification evaluated per segment (Table 7). With the consistently high levels of both sensitivity and specificity, there is potential for future epidemiological and/or clinical studies using the FOREVER dataset. In that case, pre/post-processing of the images would be preferable to increase the certainty of correctly segmented retinal vessels.
In the FOREVERP dataset, the mean (± standard deviation) difference between the GT data and the width predicted by QUARTZ was -0.80 (1.96), showing that QUARTZ detected vessels as having a width larger than the GT.
In the UK Biobank dataset, the difference between the GT and QUARTZ was 0.70 (1.13) [27], showing that QUARTZ detected vessels as having a width smaller than the GT ( Table 8). The width measurements are very stable within each dataset and the discrepancy between datasets is likely explained by the subjective nature of human observers. The larger image size of the FOREVERP dataset (1.75x of UK Biobank) may alter the perception of the vessel edges perceived by the human observers causing the differences in how they annotate vessel widths between the two datasets. The magnitude of the mean difference in both datasets is similar (4.9% FOREVERP and 6.4% UKBB) when assessed as a comparison to the mean vessel profile width.
The detection rate of localising the optic disc correctly was similar for the two datasets with a detection rate of 97.33% for the FOREVERP dataset compared with 97.60% for UK Biobank [27].
As shown in this paper, QUARTZ demonstrated a high performance across two different datasets. Whilst this demonstrates robustness, calibration and training on each dataset were  required. Therefore, a future generic version of QUARTZ would be the next step, compatible across multiple retinal datasets with no or minimal calibration. To achieve a more universal system, further validation of QUARTZ in multi-ethnic populations would be necessary, as they may present alterations in retinal pigmentation and disease risk profiles compared to a Danish and British population. All the modules/components of this next version will be more heavily driven by AI, specifically deep learning. Such a system would make widespread research or adoption to predict the risk of disease based on fundus image analysis more feasible. An alternative approach to vessel morphometry is end-to-end AI, in which a single deep learning model directly maps the input images to the risk of disease [42]. A wider application of AI in ophthalmology is promising, as ophthalmology is highly dependent on imaging modalities for diagnosis as well as monitoring disease progression [44]. AI has made large strides in recent years due to the rise of deep learning, for example, to predict the risk of disease or identify pathology [36,45], such as "IDx-DR", which has been approved by U.S Food and Drug Administration (FDA) to screen for diabetic retinopathy [46,47]. However, AI (particularly end-to-end) approaches have mainly been developed for research purposes as the method is challenging to implement in a clinical setting. This is partly due to the complexity of the analysis performed by the algorithm, which makes it difficult to assess how the input data relates to the final output data [44,[48][49][50]. To visualize the parameters on which the algorithm has made its decisions, heatmaps can be developed thus increasing transparency [49]. Nevertheless, it remains a challenge for the future implementation of AI in healthcare that the algorithms are not based on medical reasoning [49,51].
AI algorithms can be trained with publicly available datasets such as DRIVE (Digital Retinal Images for Vessel Extraction) [52], STARE (Structured Analysis of the Retina) [53], HRF (High-Resolution Fundus Image Database) [54] and CHASE_DB1 (a subset of retinal images from the Child Heart and Health Study in England) [55] or larger datasets such as UK Biobank [56] and EPIC (European Prospective Investigation into Cancer and Nutrition) [57], available to researchers upon approval of access to the databases. High-quality data are important to ensure the reliability of an algorithm, and the training and test data should match the data from the clinical population of interest [7]. AI methods rely on access to representative and comprehensive training datasets [11,49], and approaches that use deep learning often require large datasets to avoid overfitting models. Although algorithms perform well on a public dataset, there is no guarantee for a high performance on the clinical data of interest, as the datasets may vary both in terms of the technical settings and the population being studied. If the publicly available datasets are not suitable for the clinical data of interest, researchers may need to create larger datasets from appropriate settings, as in the case with the FOREVER dataset. With linkage to data from the comprehensive national registries, the FOREVER dataset can in future be used for large-scale epidemiological research.

Conclusion
In conclusion, QUARTZ exhibited high performance on a subset of 3,682 retinal images from the FOREVERP dataset. Compared with previously published data from the UK Biobank, QUARTZ's performance on the FOREVERP dataset was at the same level or higher. QUARTZ has hereby shown robustness across datasets, enabling future research linking vessel morphometry with epidemiological data aiming to find novel biomarkers of disease. would like to thank "Fight for Sight Denmark" for providing valuable insight into patient per-